{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Basic PUMS Analysis with OpenDP\n",
    "\n",
    "This notebook is a brief tutorial on doing data analysis within the OpenDP system.\n",
    "\n",
    "We start out by establishing what is considered public knowledge about the data.\n",
    "Generally speaking, it is better to wait to load your data until you are ready to invoke the function.\n",
    "This helps keep us honest while we are writing data preprocessors,\n",
    "because it is tempting to look at the data to help fill missing parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# establish public information\n",
    "col_names = [\"age\", \"sex\", \"educ\", \"race\", \"income\", \"married\"]\n",
    "# the greatest number of records that any one individual can influence in the dataset\n",
    "max_influence = 1\n",
    "# we can also reasonably intuit that age and income will be numeric,\n",
    "#     as well as bounds for them, without looking at the data\n",
    "age_bounds = (0, 100)\n",
    "income_bounds = (0, 150_000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Any time you collect non-DP information from the private data to build your measurement,\n",
    "the privacy budget will be underestimated.\n",
    "\n",
    "The vetting process is currently underway for the code in the OpenDP Library.\n",
    "Any constructors that have not been vetted may still be accessed if you opt-in to \"contrib\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "from opendp.mod import enable_features\n",
    "enable_features('contrib')"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Working with CSV data\n",
    "Let's examine how we can process csv data with Transformations.\n",
    "\n",
    "I'm going to pull a few constructors from the [Dataframes section in the user guide](https://docs.opendp.org/en/stable/user/transformation-constructors.html#dataframe).\n",
    "\n",
    "We start with `make_split_dataframe` to parse one large string containing all the csv data into a dataframe.\n",
    "`make_split_dataframe` expects us to pass column names, which we can grab out of the public information.\n",
    "`make_select_column` will then index into the dataframe to pull out a column where the elements have a given type `TOA`.\n",
    "The `TOA` argument won't cast for you; the casting comes later!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "from opendp.trans import make_split_dataframe, make_select_column\n",
    "income_preprocessor = (\n",
    "    # Convert data into a dataframe where columns are of type Vec<str>\n",
    "    make_split_dataframe(separator=\",\", col_names=col_names) >>\n",
    "    # Selects a column of df, Vec<str>\n",
    "    make_select_column(key=\"income\", TOA=str)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the sake of exposition, we're going to go ahead and load up the data.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "59,1,9,1,0,1\n",
      "31,0,1,3,17000,0\n",
      "36,1,11,1,0,1\n",
      "54,1,11,1,9100,1\n",
      "39,0,5,3,37000,0\n",
      "34,0,9,1,0,1\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "data_path = os.path.join('.', 'data', 'PUMS_california_demographics_1000', 'data.csv')\n",
    "\n",
    "with open(data_path) as input_file:\n",
    "    data = input_file.read()\n",
    "\n",
    "print('\\n'.join(data.split('\\n')[:6]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "As we can see from the first few rows, it is intentional that there are no column names in the data.\n",
    "If your data has column names, you will want to strip them out before passing data into your function.\n",
    "\n",
    "Now if you run the transformation on the data, you will get a list of incomes as strings.\n",
    "I've limited the output to just the first few income values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'list'>\n",
      "['0', '17000', '0', '9100', '37000', '0']\n"
     ]
    }
   ],
   "source": [
    "transformed = income_preprocessor(data)\n",
    "print(type(transformed))\n",
    "print(transformed[:6])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Casting\n",
    "Income doesn't make sense as a string for our purposes,\n",
    "so we can just extend the previous preprocessor to also cast and impute."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0, 17000, 0, 9100, 37000, 0]\n"
     ]
    }
   ],
   "source": [
    "from opendp.trans import make_cast, make_impute_constant\n",
    "\n",
    "# make a transformation that casts from a vector of strings to a vector of ints\n",
    "cast_str_int = (\n",
    "    # Cast Vec<str> to Vec<Option<int>>\n",
    "    make_cast(TIA=str, TOA=int) >>\n",
    "    # Replace any elements that failed to parse with 0, emitting a Vec<int>\n",
    "    make_impute_constant(0)\n",
    ")\n",
    "\n",
    "# replace the previous preprocessor: extend it with the caster\n",
    "income_preprocessor = income_preprocessor >> cast_str_int\n",
    "print(income_preprocessor(data)[:6])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Great! Now we have integer income data from our CSV.\n",
    "A quick aside, keep in mind that we can invoke transformations on almost anything to do some testing.\n",
    "For example, we still have a handle on ``cast_str_int``, don't we?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[0, 0, 2]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cast_str_int([\"null\", \"1.\", \"2\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "### Private Count\n",
    "Time to compute our first aggregate statistic.\n",
    "Suppose we want to know the number of records in the dataset.\n",
    "\n",
    "We can use the [list of aggregators](https://docs.opendp.org/en/stable/user/transformation-constructors.html#aggregators)\n",
    "in the Transformation Constructors section of the user guide to find `make_count`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "from opendp.trans import make_count\n",
    "count = income_preprocessor >> make_count(TIA=int)\n",
    "# NOT a DP release!\n",
    "count_response = count(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Be careful!\n",
    "`count` is still only a transformation,\n",
    "so the output in `count_response` is not a differentially private release.\n",
    "You will need to chain with a measurement to create a differentially private release.\n",
    "\n",
    "We use `make_base_geometric` below, because `make_base_geometric` has an integer support.\n",
    "Notice that we now import from `opendp.meas`, and the resulting `type(dp_count)` is `Measurement`. This tells us that the output will be a differentially private release."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "from opendp.meas import make_base_geometric\n",
    "dp_count = count >> make_base_geometric(scale=1.)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "In any realistic situation, you would likely want to estimate the budget utilization before you make a release.\n",
    "Use a search utility to quantify the privacy expenditure of this release."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "DP count budget: 1.0000000009313226\n",
      "DP count: 1000\n"
     ]
    }
   ],
   "source": [
    "from opendp.mod import binary_search\n",
    "\n",
    "# estimate the budget...\n",
    "epsilon = binary_search(\n",
    "    lambda eps: dp_count.check(d_in=max_influence, d_out=eps),\n",
    "    bounds=(0., 100.))\n",
    "print(\"DP count budget:\", epsilon)\n",
    "\n",
    "# ...and then release\n",
    "count_release = dp_count(data)\n",
    "print(\"DP count:\", count_release)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Private Sum\n",
    "\n",
    "Suppose we want to know the total income of our dataset.\n",
    "First, take a look at [the list of aggregators](https://docs.opendp.org/en/stable/user/transformation-constructors.html#aggregators).\n",
    "`make_bounded_sum` seems to meet our requirements.\n",
    "As indicated by the function's API documentation, it expects bounded data,\n",
    "so we'll also need to chain the transformation from `make_clamp` with the `income_preprocessor`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "from opendp.trans import make_clamp, make_bounded_sum\n",
    "bounded_income_sum = (\n",
    "    income_preprocessor >>\n",
    "    # Clamp income values\n",
    "    make_clamp(bounds=income_bounds) >>\n",
    "    # These bounds must be identical to the clamp bounds, otherwise chaining will fail\n",
    "    make_bounded_sum(bounds=income_bounds)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "In this example, instead of just passing a scale into `make_base_geometric`,\n",
    "lets say I want whatever scale will make my measurement 1-epsilon DP.\n",
    "Again, I can use a search utility to find such a scale."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "from opendp.mod import binary_search_param\n",
    "\n",
    "discovered_scale = binary_search_param(\n",
    "    lambda s: bounded_income_sum >> make_base_geometric(scale=s),\n",
    "    d_in=max_influence,\n",
    "    d_out=1.)\n",
    "\n",
    "dp_sum = bounded_income_sum >> make_base_geometric(scale=discovered_scale)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or more succinctly..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "DP sum: 30255829\n"
     ]
    }
   ],
   "source": [
    "from opendp.mod import binary_search_chain\n",
    "\n",
    "dp_sum = binary_search_chain(\n",
    "    lambda s: bounded_income_sum >> make_base_geometric(scale=s),\n",
    "    d_in=max_influence,\n",
    "    d_out=1.)\n",
    "\n",
    "# ...and make our 1-epsilon DP release\n",
    "print(\"DP sum:\", dp_sum(data))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "### Private Mean\n",
    "\n",
    "We may be more interested in the mean age.\n",
    "Just like before, we reference the docs to find `make_sized_bounded_mean`.\n",
    "The name of the constructor indicates that it expects sized, bounded data,\n",
    "and the docstring points us toward preprocessors we can use.\n",
    "\n",
    "Sized data is data that has a known number of rows.\n",
    "The constructor enforces this requirement\n",
    "because knowledge of the dataset size was necessary to establish the privacy proof.\n",
    "\n",
    "Luckily, we've already made a DP release of the number of rows in the dataset,\n",
    "which we can reuse as an argument here.\n",
    "\n",
    "Putting the previous sections together, our bounded mean age is:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Intermediate domains don't match. See https://github.com/opendp/opendp/discussions/297\n",
      "    The structure of the intermediate domains are the same, but the types or parameters differ.\n",
      "    shared_domain: VectorDomain(AllDomain())\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from opendp.trans import make_cast_default, make_bounded_resize, make_sized_bounded_mean\n",
    "from opendp.mod import OpenDPException\n",
    "\n",
    "try:\n",
    "    mean_age_preprocessor = (\n",
    "        # Convert data into a dataframe of string columns\n",
    "        make_split_dataframe(separator=\",\", col_names=col_names) >>\n",
    "        # Selects a column of df, Vec<str>\n",
    "        make_select_column(key=\"age\", TOA=str) >>\n",
    "        # Cast the column as Vec<float>, and fill nulls with the default value, 0.\n",
    "        make_cast_default(TIA=str, TOA=float) >>\n",
    "        # Clamp age values\n",
    "        make_clamp(bounds=age_bounds)\n",
    "    )\n",
    "except OpenDPException as err:\n",
    "    assert err.message.startswith(\"Intermediate domains don't match.\")\n",
    "    print(err.message)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Wait a second! The intermediate domains don't match?\n",
    "In this case, we casted to float-valued data, but `make_clamp` was built with integer-valued bounds,\n",
    "so the clamp is expecting integer data.\n",
    "Therefore, the output of the cast is not a valid input to the clamp.\n",
    "We can fix this by adjusting the bounds and trying again."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "float_age_bounds = tuple(map(float, age_bounds))\n",
    "\n",
    "mean_age_preprocessor = (\n",
    "    # Convert data into a dataframe of string columns\n",
    "    make_split_dataframe(separator=\",\", col_names=col_names) >>\n",
    "    # Selects a column of df, Vec<str>\n",
    "    make_select_column(key=\"age\", TOA=str) >>\n",
    "    # Cast the column as Vec<float>, and fill nulls with the default value, 0.\n",
    "    make_cast_default(TIA=str, TOA=float) >>\n",
    "    # Clamp age values\n",
    "    make_clamp(bounds=float_age_bounds) >>\n",
    "    # Resize the dataset to length `count_release`.\n",
    "    #     If there are fewer than `count_release` rows in the data, fill with a constant of 20.\n",
    "    #     If there are more than `count_release` rows in the data, only keep `count_release` rows\n",
    "    make_bounded_resize(size=count_release, bounds=float_age_bounds, constant=20.) >>\n",
    "    # Compute the mean\n",
    "    make_sized_bounded_mean(size=count_release, bounds=float_age_bounds)\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "I stopped just short of chaining with a measurement because we're working with float data.\n",
    "There are some extra considerations to take in mind with floating-point data,\n",
    "which are [covered in the user guide](https://docs.opendp.org/en/latest/user/measurement-constructors.html#floating-point).\n",
    "\n",
    "With the assumption that you understand the ramifications, I'll go ahead and finish the query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "DP mean: 44.303365954161585\n"
     ]
    }
   ],
   "source": [
    "from opendp.meas import make_base_laplace\n",
    "from opendp.mod import enable_features\n",
    "enable_features(\"floating-point\")\n",
    "# add laplace noise\n",
    "dp_mean = mean_age_preprocessor >> make_base_laplace(scale=1.0)\n",
    "\n",
    "mean_release = dp_mean(data)\n",
    "print(\"DP mean:\", mean_release)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Depending on your use-case, you may find greater utility separately releasing a DP sum and a DP count,\n",
    "and then postprocessing them into the mean.\n",
    "In the above mean example, you could even take advantage of this to avoid using floating-point numbers.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "### Private Variance\n",
    "\n",
    "This is the last quick example, just to show a complete computation chain.\n",
    "In this example, I chain with the gaussian mechanism instead, with a budget of .1 epsilon, 1e-5 delta."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "DP variance: 41.24415287647741\n"
     ]
    }
   ],
   "source": [
    "from opendp.meas import make_base_gaussian\n",
    "\n",
    "variance = (\n",
    "    # Convert data into a dataframe of string columns\n",
    "    make_split_dataframe(separator=\",\", col_names=col_names) >>\n",
    "    # Selects a column of df, Vec<str>\n",
    "    make_select_column(key=\"age\", TOA=str) >>\n",
    "    # Cast the column as Vec<float>, and fill nulls with the default value, 0.\n",
    "    make_cast_default(TIA=str, TOA=float) >>\n",
    "    # Clamp age values\n",
    "    make_clamp(bounds=float_age_bounds) >>\n",
    "    # Resize the dataset to length `count_release`.\n",
    "    make_bounded_resize(size=count_release, bounds=float_age_bounds, constant=20.) >>\n",
    "    # Compute the mean\n",
    "    make_sized_bounded_mean(size=count_release, bounds=float_age_bounds)\n",
    ")\n",
    "\n",
    "dp_variance = binary_search_chain(\n",
    "    lambda s: variance >> make_base_gaussian(scale=s),\n",
    "    d_in=max_influence,\n",
    "    d_out=(.1, 1e-5) # (epsilon, delta)-DP\n",
    ")\n",
    "\n",
    "print(\"DP variance:\", dp_variance(data))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}